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Method for creating and accessing a menu for audio content 
without using a display 

This invention relates to an audio management system that 
5 allows a user to browse through stored audio files in a very 
natural way. The invention concerns large -capacity digital 
storage -playback systems for audio content like MPEG audio 
layer 3 (WP3) players - 

10 

Background 

Driven by the recent advances in the technologies of digital 
storage and- audio compression, the problem of managing very 
15 big collections of audio files becomes predominant. For 

instance, the current generation of MP 3 players contains a XQ 
GB hard disk drive which enables users to store e.g. more than 
300 hours of MP 3 PRO music, meaning more than 4.000 titles. 

20 Reliable tools are required tQ~make those collections 
accessible to the users . 

The classical way of indexing audio, files is based on textual 
xneta- information like title, artist, album or genre, like e.g. 
25 ID 3 tags for MP3 audio files. 

There are some drawbacks with this kind of. organization i : - . 
1. The metadata are textual, and not audio, and therefore 
cannot give a precise representation of an audio content , 
.like a representative extract of the content can do. 
30 2 m Organisation sorted by genre or by artist allows users to 

locate a particular piece of music. This presupposes that 
users have, well-defined goals, knowing exactly what. .they 
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want to hear. The users searching strategy must be goal- 
driven and deterministic. 

3. There are a lot of genres t for instance, the music 

• archive mp3.com currently lists its titles under 180 
5 different sub-genres, organized in 16 main genres. It is 

difficult for a user to navigate in such organization. 

4. Genres are sometimes subjective because they are 
established a priori and not deduced from the content 
itself. Sometimes they are difficult to interpret. 

10 5. A. classification by genres is not able to satisfy very - 

simple user needs like for instance -This piece of music 
is . relaxing me. I would like to hear more like this". 

The present- invention is directed to overcoming these 
15 drawbacks. 



Invention 

20 The present invention, deals with a process and system for 
navigating through a large amount of audio files, e.g. MP3 
files, using brief representatives of the audio content. 
Before a user selects a music track, he can benefit from 
hearing a brief representative excerpt, in the following 

25 referred to as "audio thumbnail". An audio thumbnail is of 

sufficient length to recognize the music, e.g. 5 or 6 seconds. 

The stored audio files are preprocessed in order to extract 
some relevant and objective descriptors. According to the 
30 invention, these descriptors are used to cluster the music 

tracks into perceptually homogeneous groups. From each- cluster 
a relevant track is selected automatically or manually, or 
semi-automatically, and from said selected track an audio 
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thumbnail is extracted. Then these audio thumbnail* being key 
phrases are arranged in a tree data structure, or table of 
contents, that enables the user to navigate without any visual 
navigation means, like display. 

Furthermore, the audio thumbnails allow the user to navigate 
perceptually through the audio database, without having to 
remember textual elements, like title or artist names. It xs 
particularly suited to enable users without precise idea of 
what they want to hear to browse their database, and to select 
perceptually from clusters of songs. Perceptually means here 
that the thumbnails address to the perception, and not memory, 
of users- Also, said clusters are perceptive, meaning that the 
structuring of the clusters is relevant for users, and thus 
said structuring meets real user needs . 

Using this invention, users can create play lists beyond- the- 
classical music categories like pop or country. 



Brief des cri ption of the drawings 

Exemplary embodiments of the invention are described with 
reference to the. accompanying drawings, which- show, in 

Fig.l ah exemplary architecture of an audio reproduction 
system using an audio menu; 

Fig. 2 an exemplary user interface without display. 
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Detailed description of the invention 

The present invention describee a method for creating, 
organizing and using audio representations for audio content. 
5 The structure of the invention is shown in Figure 1 . Audio 
tracks, usually music/ are stared in a storage means S. The 
tracks are classified in a classifier CL and associated to a 
cluster of tracks 01,02,03. For each cluster a representative 
example is selected in a representative selector R. Further, 

10 an extractor X extracts a characteristic sample, or thumbnail, 
from said example, and the thumbnail is associated to a table 
of contents T, The user uses an interface I to select a first 
cluster represented by a first thumbnail, listen to the 
selected thumbnail, and decide whether to select another 

15 cluster, or select said first cluster related to said first 
thumbnail and then select a track belonging to said cluster , 
which track is then read from the storage* means S and 
reproduced. 

20 Advantageously, this approach is more perception based than 
previous methods, and therefore more convenient to the user. 
An audio -based indexing system according to the invention 
combines two methods that are known from other content search 
systems, namely the * table-of- contents' method and the *radio- 

25 like navigation' method. 



The * table -of -contents' method relates to a books table of 
contents, where short representative aequences summing up the 
actual text are grouped according to the structure of the ■ 
30 books contents. This usually correlates with a logical 
classification into topics, U^ing this method for audio 
content .means extracting parameters, or descriptors, from the 
audio file, following objective criteria defined below, and. 
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then group together homogeneous tracks in clusters . From a 
user's point of view, these clusters make sense because their 
content-based nature is going farther than the a priori 
classification in genres. E.g. all the fragments of guitar 
5 music, coming from all genres, may be grouped together in a 
cluster. All the relaxing music can constitute -another • 
cluster.. According to the invention, the different clusters 
constitute the "table of contents" of the database. And, like 
in a book's table of contents, there may be different levels 
10 of details, like e.g. chapter 1, chapter 1.1, etc. Like the 

reader can navigate from chapter to chapter, and may decide to 
read a chapter more in detail, the listener can navigate from 
cluster to cluster, or may decide to listen to more, similar 
music from a cluster. 

15 

The * radio-like navigation' method relates to typical user- 
behavior when listening to the radio. Content browsing in this 
context is e.g. the user scanning the FM band on. a car radio, 
and listening to a track or switching to the next station. The 
20 invention uses this concept, with a radio station 

corresponding to a cluster of tracks.- Then » switch to another 
station' corresponds to 'select another cluster', and 'listen 
to a track' corresponds to 'listen to this track or to a 
similar track from the same cluster' . 



25 



30 



In the following, the afore mentioned steps in creating and 
organizing audio representations are described in detail, the 
steps. being performed when a track is newly added to the 
database, or when the database is reorganized. 

In a first step descriptors are extracted from the audio 
track. Three types of descriptors are used, trying to be 
objective and still relevant for- the user. 



Fax re S u de : +49 511 418 Z811 

PF03GQQ1-K6-2002-12-12 



6 



06/"01/03 11:37 Pg: 14 



The first type of descriptors is low- level descriptors, or 
physical features, as being typical for signal processing 
methods. Examples are spectral centroid, short-time energy or 
5 short -time average zero -crossing. 

The second type of descriptors is medium-level descriptors, or 
perceptual features, as being typically used by a musician. 
Examples are rhythm, e.g. binary or ternary rhythm, tonality, 
the kind of formation, e.g. voice or special instruments. 



10 
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The third type of descriptors is high-level descriptors, or 
psychological and social features of the track, as being, 
normal for the average user. Trying to minimize the subjective 
nature of these features, it is e.g. possible to classify 
music as being happy or anxious, calm or energetic. These 
characteristics can be assigned to a certain degree, or with a 
certain probability, to a piece of music, when e.g. 
descriptors of the previously described types are used. Also, 
a song can be highly memorable, can convey a certain mood or 
emotion, can remind the user of something* etc. This may be 
done automatically using supervised algorithms, i.e. with 
algorithms that require user interaction. 



The second step consists of clustering the music tracks. Using 
the descriptors defined in the first step; the. tracks can- be 
classified into homogeneous classes. These classes are more 
valuable to a user than classifying music, by artist or title. 
Unsupervised algorithms may be used to cluster the tracks into 
30 packets with similar properties. Examples of such algorithms 
are the K-means or Self Organising Maps. A. new cluster may be 
automatically generated when the dissimilarity of a newly 
added track, compared to existing clusters, reaches a certain 
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minimum level, and in that case the newly added track will be 
associated with the new cluster. 

At this point, the tracks are classified and therefore it is 
5 possible to create a table of contents. There is no sharp 

classification required, e.g. it is possible to have the same 
track in any. number of clusters. For example, one cluster may 
be for guitar music, while another cluster may be for calm 
music, and a track matching both characteristics may be 
10 associated with both clusters. - In this case, , both clusters may 
contain a link to said audio track, but the track itself needs 
to be stored only once* 

The third step consists of automatically selecting a 
15 representative track for each cluster. Advantageously, the 
most representative track for a cluster is selected, using 
classical medoid selection. A medoid is that object of a 
cluster whoee average dissimilarity to all objects of the 
cluster is minimal- Said dissimilarity aane.g-be determined 
20 using' the descriptors that were extracted during the first 
step. 

In the fourth step an audio thumbnail is created and stored 
for the' medoid track. In another embodiment of the invention 

25. an audio- thumbnail may be created and stored also for other 

tracks. For- thumbnail creation it is evaluated which criteria 
are the best to characterize an audio track by- a short' audio 
sequence - f the- audio sequence being long enough to recognize 
the track,, e.g. 5 or 6 seconds. In one embodiment of the 

30 invention the length of thumbnails is constant, in a . second 
embodiment the length of thumbnails can be modified, and in a 
third embodiment the length of thumbnails can vary from track 
to track, according to the tracks descriptors. Further, -in one 
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embodiment of .the invention a thumbnail is an original sample 
from the track, or in another embodiment it is automatically 
synthesized from said track. 

In the fifth step the audio thumbnails are listed .in a virtual 
table, which can be scanned through by the user, like scanning, 
through different radio stations. The table may be organized 
such that within a cluster the most relevant track, or medoid, 
will be found first when scanning through the table. Other 
tracks within a cluster' may be sorted/ e.g. according to 
relevance. Advantageously, no graphical or textual display .is 
required for scanning through the table of contents. The 
structure of the table of contents may be as follows! 



i ■ 
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A uear may decide to listen to the current track, or to. 
another track belonging to the same cluster and therefore 
being similar to said current track- Alternatively the user 
may decide to listen to a track from another cluster. 
Advantageously, only one button, or other means, of command 
input,, is required to operate the navigation system, namely 
for * Switch Cluster' - More comfortable to the user is a device 
with three buttons, as shown in Figure 2. One button SD is for 
* Switch to a Wear Cluster', one button SU is for v Switch to a 
Distant- Cluster', and one button M is for ^Switch to another 
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track from the Current Cluster' . Alternatively, it is 
sufficient to have only one button, if the button has more 
than one function, or other means of user input, Other 
functions controlled by user input could be e.g. random track 
5 selection or random cluster selection mode. Another function 
could be to successively reproduce the representatives of all 
clusters until the user selects one cluster, said function 
being advantageous because the user needs not manually scan 
through the table of contents. 

10 

Further embodiments are described in the following. 

in one embodiment of the invention an audio track belongs... to 
only one cluster, while in another embodiment an audio track 
15 may belong to more than one cluster, when the respective class 
criteria are not excluding each other. 

In one embodiment of the invention the table of contents has 
only one level of clustering, like in the previously described 
example, while in another embodiment the table of contents can 
have more hierarchical levels of clusters. 

in one embodiment of the invention the classification rules 
for audio. tracks are final, while in another embodiment said: • 
rules may be modified. Said modification may happen either by 
an update, e.g. via internet, or by any form of user 
interaction, e.g. upload to PC, edit and download . from PC,' or 
by statistical or self learning methods as used e.g. by . 
artificial .intelligence. This may be implemented such, that an 
automatic or semi-automatic reclassification with modified or 
enhanced rules may be performed when e.g. the number of tracks 
associated with. one cluster is much higher than the number of 
tracks associated with any other cluster. 
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in one embodiment of the invention thumbnail* may be created 
only for- tracks representing a cluster. . In another embodiment 
of the invention thumbnails may be created .also for other 
tracks, e.g. tracks that fulfill a certain condition like 
being selected very often or very rarely, or being very long. 
In a third embodiment thumbnails are created for all tracks. 

In one embodiment of the invention the tracks within a cluster 
may have a constant order, so that the user can learn after a 
while when a- certain track comes- The order can follow the ••■ 
tracks relevance, or any other parameter, e.g. storage time, 
or frequency of selection. In another embodiment of the- 
invention the tracks within a cluster may. be. unordered, or 
appear randomly when the user selects a cluster. 

in one embodiment of the invention there is a representative 
track selected for each cluster,- while in another embodiment 
it may be useful to have no representative track for one of 
said clusters, e.g. a cluster for favorites or a cluster for 
tracks not being classifiable by the employed- methods. 

Advantageously the described method for perception based 
classification and retrieval of audio contents can be used in 
devices, preferably portable devices, for storage and 
reproduction of music or other audio data, e.g. MP3 players. 
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Claims 



1. A method for creating and accessing a menu for /audio 
content stored in a storage means (S) , the content 
5 consisting of audio tracks , and the menu containing 

representations of said audio tracks, characterised ixi 

- classifying {CD the audio tracks into groups, or 
clusters (Cl,... i C3) wherein said classification is 
performed according to characteristic parameters of 

10 said audio tracks; 

- selecting (R) automatically an audio track being a 
representative for the cluster, wherein said selection 
is performed according to characteristic parameters of 
said audio track and of the other audio tracks 'of said 

IS cluster; * ' 

- generating (X) as said representation a reproducible 
audio extract from said representative audio track; and 

- associating said audio extract to a menu list (T) . 



20 2. Method according to claim 1, wherein said characteristic 

parameters used for classification of audio content- • 
comprise one or more audio descriptors, the audio 
descriptors being either physical features, or perceptual 
feature's, or psychological or social features of the 

25 audio content. 

3. Method according to any of claims 1-2, wherein an audio 
track can be classified into more than one cluster 
{ Cl , »■» , C3 ) p 



30 



Method according to any of claims 1-3, wherein the audio 
tracks within a cluster (C1,-.,C3) have variable order , so 
that the user listens to a randomly selected track when 
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having selected a cluster (C1,,„,C3), with said track 
belonging to said cluster. 

5. Method according to any of claims 1-4, wherein a user can 

modify the result of automatic classification of audio 

tracks . 



10 



6, Method according to any of claims 1-5, wherein a user can 
modify the classification rules for automatic 
classification of audio tracks. 



15 



20 



25 



7. Method according to any of claims 1-6 , wherein the actual 
audio data are clustered within said storage means (S) 
according to said menu, 

8. Method according to any of claims 1-7 , wherein the audio 
extract is a sample from the audio track, or an audio 
sequence being synthesized from the actual audio track. 

9. Method according- to any of claims 1-8, wherein audio 
extracts are created additionally for audio tracks not 
being representatives of clusters. 

10. Method according to any of claims 1-9, wherein the length 
of audio extracts is not predetermined. 

11. Method according to any. of claims l-*ia, . wherein one of 
said clusters has no representative track. 



30 



12. Method according to any of claims 1-11, wherein said menu 
is hierarchical, such that a cluster may contain one or 
more sufoclusters . 
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13- Method according to any of claims 1-12, wherein. the 
classification rule© are modified automatically if a 
defined precondition is detected, and a reclassification 
may be performed. 
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14. Method according to claim 13, wherein said precondition 
.comprises that the difference between the number of 
tracks in a cluster and the number of tracks in another 
clu3ter reaches a maximum limit value. 

15. Method according to claim 13 , wherein said precondition 
comprises that ' all stored track© were classified into one 
cluster, and the tonal number of tracks reaches a maximum 
limit value, 

IS • An apparatus for creating or accessing a menu for. audio 
■ content stored on a storage mean© (S> , the content * 
consisting of audio tracks, and the. menu containing 
representations* of audio tracks, characterised by 
• - means for automatically classifying (CL) the audio 

tracks into groups, or clusters (Cl,.„ r C3) wherein -said 
classification is performed according to characteristic 
parameters of said audio tracks; 

- means for automatically selecting (R) an audio track 
being a representative for the cluster, wherein said 
selection is performed according to characteristic 
parameters of said audio track, and of the other audio 
tracks- of said cluster; 

- means for generating (X) a reproducible audio extract 
from said representative audio track; and 

~ means for associating said audio extract to a menu list 
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17. Apparatus according to claim 16, further characterised by 

- means for select ing and reproducing a first audio 
representation from a first cluster; 

- means for a first user input (M,SU,SD), the input 
controlling whether the cluster associated with the 
currently selected audio thumbnail is selected or riot; • 
.and 

- means for a second user input (M,SU,SD), the input 
controlling whether another cluster is selected or not. 

18. Apparatus according to any of claims 16 or 17, further 
characterized in that an audio track of the selected 
cluster is read from said storage means (s) for playback. 



15 
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Abstract 

A method for creating a menu. (T) for audio content, e.g. music 
tracks, uses means (CL,) for classifying the audio content into 
5 clusters (Cl f .„,C3) of similar tracks , the similarity referring 
to physical, perceptual and psychological features of the 
tracks. The method comprises a mean© (R) for automatic 
representative selection for clusters (CI , C3 > , and a means 
(X> for generating thumbnail representations of audio tracks. 

10 Said audio thumbnails are associated to the menu (T) , 

Advantageously, no graphical or textual display is required 
for navigation, since the user may listen to an audio . 
thumbnail and then enter a command, e.g. by pressing an* 
appropriate button, for either listening to the related, track 

15 or a . similar track belonging to the same cluster, or listening 
to another type of music by selecting another thumbnail 
representing another cluster. 

20 Fig.l 
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Fig. 2 



